N95- 23688 


Performance Results of Cooperating Expert Systems in a 
Distributed Real-time Monitoring System 

U. M. Schwuttke, J. R. Veregge, and A. G. Quan 

Jet Propulsion Laboratory 
California Institute of Technology 
4800 Oak Grove Drive 
Pasadena, CA 91 109 U.S.A. 

Tel: 818-354-1414 Fax:818-393-6004 

E-mail: ums@puente.jpl.nasa.gov 


KEY WORDS AND PHRASES 

Automation, distributed systems, expert 
systems, monitoring and diagnosis, real time. 

INTRODUCTION 

There are numerous definitions for real-time 
systems, the most stringent of which involve 
guaranteeing correct system response within a 
domain-dependent or situationally defined period 
of time. For applications such as diagnosis, in 
which the time required to produce a solution can 
be non-deterministic, this requirement poses a 
unique set of challenges in dynamic modification 
of solution strategy that conforms with maximum 
possible latencies. However, another definition 
of real time is relevant in the case of monitoring 
systems where failure to supply a response in the 
proper (and often infinitesimal) amount of time 
allowed does not make the solution less useful 
(or, in the extreme example of a monitoring 
system responsible for detecting and deflecting 
enemy missiles, completely irrelevant). This 
more casual definition involves responding to 
data at the same rate at which it is produced, and 
is more appropriate for monitoring applications 
with softer real-time constraints, such as inter- 
planetary exploration, which results in massive 
quantities of data transmitted at the speed of light 
for a number of hours before it even reaches the 
monitoring system. 
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The latter definition of real time has been ap- 
plied to the MARVEL system-[l]-for automated 
monitoring and diagnosis of spacecraft telemetry. 
An early version of this system has been in 
continuous operational use since it was first 
deployed in 1989 for the Voyager encounter with 
Neptune. This system remained under incremen- 
tal development until 1991 and has been under 
routine maintenance in operations since then, 
while continuing to serve as an artificial intelli- 
gence (AI) testbed in the laboratory. A second- 
generation Galileo application has been on-line 
for only one year and is still under active devel- 
opment. The second-generation system builds 
on experience gained with the earlier embedded 
diagnosis systems to achieve an order of mag- 
nitude increase in processing capability. 

The system architecture has been designed to 
facilitate concurrent and cooperative processing 
by multiple diagnostic expert systems in a hierar- 
chical organization. The diagnostic modules 
adhere to concepts of data-driven reasoning, con- 
strained but complete nonoverlapping domains, 
metaknowledge of global consequences of anom- 
alous data, hierarchical reporting of problems 
that extend beyond a single domain, and shared 
responsibility for problems that overlap domains. 
The system enables efficient diagnosis of com- 
plex system failures in real-time environments 
with high data volumes and moderate failure 
rates, as indicated by extensive performance 
measurements. 

COOPERATING DIAGNOSIS SYSTEMS 
IN A DISTRIBUTED ARCHITECTURE 

The need for robust mechanisms of cooper- 
ation among real-time diagnostic modules has 
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Figure 1. The distributed architecture on the left can currently be configured to run on one 
to four UNIX workstations. The hybrid subsystem processes on the left are composed of 
conventional and knowledge processes, as shown in the figure on the right. Knowledge 
processes are used only when a reasoning capability is explicitly required. 


been an important driver of the system architec- 
ture. The notion of joint responsibility-[2]-as an 
alternative to the more conventional notion of 
agents acting in self-interest-[3], [4]-has been 
amended with modular problem decomposition 
and data-driven reasoning in order to minimize 
the need for communication between agents. 

The various modules in the distributed architec- 
ture of Figure 1 are allocated among a configura- 
tion of UNIX workstations. The data manage- 
ment module receives data from a source (in the 
case of our current application, the data is space- 
craft telemetry received from the Jet Propulsion 
Laboratory’s (JPL) ground data system) and allo- 
cates it to the appropriate subsystem monitor 
based on identification of data type. (Our system 
is partitioned according to the structure of the 
spacecraft, with one subsystem monitor for every 
spacecraft subsystem monitored by MARVEL, 
including command, flight data, attitude and 
articulation control, and telecommunications; 
propulsion, thermal, and power have not been 
addressed.) 

Each of the subsystem monitors provides 
algorithmic functions such as validation of 
telemetry, detection of anomalies, trend analysis, 
and automatic reporting. These functions, while 
not in themselves of interest in AI or computer 
science research, are vital components of a 


real-world diagnostic system. In addition, each 
subsystem process can provide diagnosis of 
failures based on anomalous data and recommen- 
dation of corrective actions. The latter two func- 
tions are provided by knowledge-based modules 
that are embedded within each of the individual 
subsystem monitors. The remaining modules in- 
clude the graphical user interface and display 
processes for each of the subsystem monitors, 
and the system-level diagnostic agent for 
handling failures that manifest themselves across 
multiple subsystems (and therefore cannot be 
completely analyzed by any one subsystem 
alone). Detailed reasoning examples that 
illustrate cooperation among diagnosis modules 
are presented elsewhere-[5]. 

EXPERT SYSTEM CHARACTERISTICS 

Rule-based diagnostic modules are embedded 
in efficient algorithmic code. The algorithmic 
code performs all functions that do not explicitly 
require reasoning capability, so that the use of the 
less efficient reasoning modules is limited to 
those functions for which it is essential. 

Forward-chaining demons are used to repre- 
sent domain knowledge. Reasoning is activated 
by the appearance of data that requires diagnosis. 
The initial determination that diagnosis is re- 
quired is made by algorithmic monitoring code. 
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which detects potential anomalies algorithmically 
and passes the anomalous data to an appropriate 
diagnostician. In the absence of anomalous data 
within its domain, a diagnostic system is idle. 

Each diagnostic system is responsible for a 
small, clearly partitionable domain of expertise. 
Partitioning is governed by the natural decomposi- 
tion of the system being diagnosed. This helps 
overcome disadvantages associated with rule- 
based systems for which, typically, implementa- 
tion can be intractable, execution is nondetermi- 
nistic and relatively slow, and verification can be 
difficult. Small, modular knowledge bases enable 
developers to handle more easily definable sub- 
problems. Smaller knowledge bases execute 
more efficiently, because less time is spent in 
search. Finally, smaller knowledge bases are eas- 
ier to verify. 

Each diagnostician has sufficient knowledge 
to be fully accountable for diagnoses within its 
area and has no knowledge of other domains. 

This requires that accountability for locally 
detectable failures must be local. However, the 
participation of more than one diagnostic system 
is required when symptoms manifest themselves 
in more than one domain. Each diagnostic system 
has the necessary metaknowledge to identify 
symptoms of failures that could possibly extend 
beyond its domain. Metaknowledge is contained 
in a set of rules in each knowledge base, and is 
associated with the occurrence of events whose 
analysis may require the cooperation of other 
agents. 

An expert forwards all known information 
pertaining to failures beyond its domain to anoth- 
er agent at the next higher level in the hierarchy. 
The underlying approach on forwarded messages 
is conservative; it is up to the agent receiving the 
information to determine whether a fault requiring 
a diagnostic message and an alarm has occurred, 
or whether the anomalous data has some other 
explanation. When necessary, metaknowledge is 
used to direct messages to the relevant agent(s) in 
order to complete the final analysis of the anoma- 
lous data and provide diagnosis of any associated 
failures. 

EXPERIMENTAL RESULTS 

The distributed architecture described in this 
paper has been applied to two generations of real- 
time monitoring systems. The Galileo system, 
currently under development, does not yet include 
on-line modules for diagnosis. The Voyager 


system, completed in 1991, contains four 
diagnostic expert systems (developed using a 
commercial shell) in a two-level hierarchy. 

Conventional monitoring modules for four 
of the spacecraft subsystems were completed: 
the flight data subsystem, the computer 
command subsystem, the attitude and articula- 
tion control subsystem, and the telecom sub- 
system. Three of the expert systems are embed- 
ded in conventional modules that provide data 
access/manipulation and monitoring in addition 
to providing graphical user interfaces and other 
subsystem-specific automation. The system- 
level diagnostician is not embedded within 
another module. 

The computer command subsystem (CCS) 
expert contains on the order of 150 rules, focuses 
on a relatively broad domain analysis, and is 
invoked very frequently (for almost every para- 
meter). The attitude and articulation control 
subsystem (AACS) expert contains approxi- 
mately 100 rules, and focuses on a more narrow 
domain of analysis. It is invoked infrequently. 
The telecom expert system contains’on the order 
of twenty-five rules and is invoked continuously 
(for every parameter). The flight data subsystem 
(FDS) module does not contain an expert 
system. 

Experimental evaluation on a network of 
workstations (Sun Microsystem Sparc LXs 
running Solaris 2.2) involved a series of tests to 
determine the maximum number of data parame- 
ters that could be processed per module per 
second (a subsystem module includes both the 
conventional and knowledge-based components, 
as shown in Figure 1). The primary purpose of 
this evaluation was to learn about the perfor- 
mance of the expert systems and apply our 
insights to future development on the Galileo 
application. This evaluation was not motivated 
by a need to improve the performance of the 
Voyager system, as current data rates are consid- 
erably slower than during the planetary 
encounters and are easily handled by the existing 
software configuration. 

The results are shown in Figure 2. The base- 
line performance was below expectation, with 
FDS, CCS, AACS, and Telecom processing 26, 
3, 24, and 428 parameters per second respective- 
ly, or 481 total parameters per second processed 
by the entire system. Performance profiling 
revealed that file input/output (I/O) and the 
graphical user interfaces (GUIs) rather than the 
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Figure 2. Performance results for each of the subsystem modules. 


diagnostic modules were primary performance 
bottlenecks. 

With regard to these bottlenecks, the four 
modules can be categorized as follows: FDS and 
CDS have moderately complex GUIs, and 
perform significant file I/O. AACS has the most 
complex GUI and performs very little file I/O, 
because the input files read by this subsystem are 
sufficiently small that they are read entirely into 
memory upon system initialization. Telecom has 
a simple GUI and performs no file I/O. 

Optimizing file I/O where possible improved 
performance to 53, 16, 81, and 428 parameters 
per second. (This is the only improvement 
discussed in this section that was carried forward 
to the operational system.) Simplifying the 
graphical user interface by eliminating real-time 
scrolling windows (known to be computationally 
inefficient in MOTIF user interfaces; considered 
desirable by end-users and thus included in the 
FDS, CCS, and AACS modules of the opera- 
tional system) further improved performance to 
53, 35, 172, and 428 parameters per second. 
Eliminating the graphical user interface entirely 
resulted in further performance increases to 67, 
35, 646, and 570 parameters per second. Finally, 
eliminating the expert systems yielded per- 


formance of 67, 273, 668, and 570 parameters 
per second. 

These results made it possible to gain a num- 
ber of new insights with regard to our system. 
The biggest surprise was the high performance of 
the telecom module. The combination of the 
small knowledge base and the simple user inter- 
face enables processing of 428 parameters per 
second. Elimination of both the GUI and the ex- 
pert system only results in a further performance 
improvement on the order of 25 percent, indica- 
ting that no substantial penalty is associated with 
the significant enhancement to functionality pro- 
vided by these two components of the module. 
The next generation of the system will benefit 
from this result, in that frequently performed 
analysis that requires the use of an expert system 
will be implemented with a number of small, 
cooperating modules rather than one larger 
module. This in itself is not unexpected; it is the 
magnitude of the benefit that was surprising. 
Further performance improvement could likely 
be gained with a more efficient expert system 
shell. This will be investigated, although we do 
not currently expect more than an additional 
order of magnitude improvement. 
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The AACS expert system is larger by a factor 
of four, and slower, in the worst case, by over 
two orders of magnitude. This can be explained 
by a significantly larger search space and greater 
depth in each search. Performance could likely 
be improved with a faster reasoning shell and by 
modularization of the knowledge base. However, 
the diagnostic component of this module is 
invoked sufficiently rarely (often less than once 
per hour) that this is not an important bottleneck. 
In the case of this type of module, it is preferable 
to simplify the GUI, which continues to impose 
considerable resource overhead. 

The CCS expert system is large and is 
invoked regularly as part of ongoing trend analy- 
sis in that subsystem module. Elimination of the 
expert system results in an additional order of 
magnitude increase in performance, providing 
further indication that a large knowledge base is 
inappropriate for frequently invoked real-time 
diagnosis. The CCS knowledge base is charac- 
terized by breadth rather than depth. As a result, 
it would be both beneficial (and straight-forward) 
to reduce it to three or more component modules 
without imposing significant overhead from 
resulting interprocess communication. (If this 
were implemented, the CCS module would still 
be I/O bound, as it reads from a number of very 
large files.) 

As a result of these insights, the Galileo 
implementation takes a more efficient approach 
to file I/O. It also tends to be more efficient in its 
graphical user interface, in that it does not include 
some of the higher overhead user interface 
widgets. Such changes impact functionality, 
requiring a certain amount of negotiation with 
end users (who are typically willing to compro- 
mise in favor of performance). In addition, the 
Galileo system makes greater use of the distribut- 
ed architecture with more than one module per 
subsystem, and more than one diagnostic compo- 
nent per module. 

CONCLUSION 

The MARVEL distributed architecture 
demonstrates the successful implementation of 
multiple cooperating agents in a complex real- 
time diagnostic system. We have designed an 
architecture that facilitates concurrent and coop- 
erative processing by multiple agents in a hier- 
archical organization. These agents adhere to the 
concepts of data-driven embedded diagnosis, 


constrained but complete nonoverlapping 
domains, metaknowledge of global consequences 
of anomalous data, hierarchical reporting of 
problems that extend beyond an agent’s domain, 
and shared responsibility for problems that 
overlap domains. 

The MARVEL architecture is simple and 
well suited for real-time telemetry analysis. 
Conventional processing is used wherever possi- 
ble in order to facilitate performance. The 
knowledge-based agents are embedded within 
the algorithmic code, and are invoked only when 
necessary for diagnostic reasoning. Distribution 
of telemetry monitoring and diagnostic processes 
across workstations provides significant 
improvement in performance. These qualities 
allow for efficient real-time diagnosis of 
anomalies occurring in a complex application. 

Maximum modularization of frequently 
invoked reasoning modules will enable signifi- 
cant performance improvements in the next 
generation system. 
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